In this project, we will analyze a database containing data on various aspects of residential homes in Ames, Iowa.
Our initial step involves a comprehensive exploratory data analysis to identify potential missing values and outliers. We will make decisions to address these issues.
Secondly, we will conduct a Principal Component Analysis (PCA). This technique aims to condense information from the original variables into a few linear combinations. The objective is to achieve dimensionality reduction while maximizing variance. These linear combinations are designed to be perpendicular to each other, aligning with the directions of maximum variance and ensuring lack of correlation.
Next, we will perform Factor Analysis (FA), identifying latent variables that exhibit a high correlation with specific groups of observable variables and minimal correlation with others. FA facilitates dimensionality reduction by capturing the underlying structure in the data.
In the final stage, we will execute both Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA). Prior to these analyses, we will verify the necessary assumptions of normality. Discriminant Analysis is a classification method for qualitative variables. It allows the categorization of new observations based on their characteristics (explanatory or predictor variables) into different categories of the qualitative response variable.Data pre-processing
# Package required to call 'freq' and 'descr' functions (descriptive statistics)
library(summarytools)
# Package required to call 'ggplot' function (graphical tools)
library(ggplot2)
# Package required to call 'read.spss' function (loading '.spss' data format)
library(foreign)
# Package required to call 'read_xlsx' function (loading '.xlsx' data format)
library(readxl)
# Package required to load the data set 'RBGlass1'
library(archdata)
library(ggpubr)
# Package required to call 'fviz_pca_var, fviz_pca_ind and fviz_pca' functions
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Package required to call 'scatterplot3d' function
library(scatterplot3d)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
The file train.csv has the following data:
GrLivArea: Above grade (ground) living area square feet
GarageArea: Area of the garage
X1stFlrSF: Area of the first floor
LotArea: Lot size in square feet
OverallQual: Rates the overall material and finish of the house
YearBuilt: Year of construction
SalePrice: Sale price
The data file has a .csv extension, so the read.csv function is used to load the file.
data <- read.csv("train.csv")
Select the columns mentioned before:
data <- data[c("GrLivArea","GarageArea", "X1stFlrSF", "LotArea", "OverallQual", "YearBuilt", "SalePrice")]
colnames(data)
## [1] "GrLivArea" "GarageArea" "X1stFlrSF" "LotArea" "OverallQual"
## [6] "YearBuilt" "SalePrice"
Initially, we examine whether the database (BD) contains missing
values. To accomplish this, we create the following function to
calculate the percentage of missing values in a vector. Subsequently, we
apply this function to the entire data frame using the
apply function.
percentageNA<-function(data){
c=(sum(is.na(data)))/length(data)*100
return(c)
}
contNA<-apply(data,2,percentageNA)
contNA
## GrLivArea GarageArea X1stFlrSF LotArea OverallQual YearBuilt
## 0 0 0 0 0 0
## SalePrice
## 0
We don’t have missing data so we can continue, directly, with the next step.
We will conduct a preliminary exploratory data analysis of the dataset. As the variables are quantitative, we will present basic numerical descriptive statistics along with visual representations such as histograms, density plots, and boxplots.
for (col in c("GrLivArea","GarageArea", "X1stFlrSF", "LotArea", "OverallQual", "YearBuilt", "SalePrice")) {
print(descr(data[col]))
}
## Descriptive Statistics
## data$GrLivArea
## N: 1460
##
## GrLivArea
## ----------------- -----------
## Mean 1515.46
## Std.Dev 525.48
## Min 334.00
## Q1 1129.00
## Median 1464.00
## Q3 1777.50
## Max 5642.00
## MAD 483.33
## IQR 647.25
## CV 0.35
## Skewness 1.36
## SE.Skewness 0.06
## Kurtosis 4.86
## N.Valid 1460.00
## Pct.Valid 100.00
## Descriptive Statistics
## data$GarageArea
## N: 1460
##
## GarageArea
## ----------------- ------------
## Mean 472.98
## Std.Dev 213.80
## Min 0.00
## Q1 333.00
## Median 480.00
## Q3 576.00
## Max 1418.00
## MAD 177.91
## IQR 241.50
## CV 0.45
## Skewness 0.18
## SE.Skewness 0.06
## Kurtosis 0.90
## N.Valid 1460.00
## Pct.Valid 100.00
## Descriptive Statistics
## data$X1stFlrSF
## N: 1460
##
## X1stFlrSF
## ----------------- -----------
## Mean 1162.63
## Std.Dev 386.59
## Min 334.00
## Q1 882.00
## Median 1087.00
## Q3 1391.50
## Max 4692.00
## MAD 347.67
## IQR 509.25
## CV 0.33
## Skewness 1.37
## SE.Skewness 0.06
## Kurtosis 5.71
## N.Valid 1460.00
## Pct.Valid 100.00
## Descriptive Statistics
## data$LotArea
## N: 1460
##
## LotArea
## ----------------- -----------
## Mean 10516.83
## Std.Dev 9981.26
## Min 1300.00
## Q1 7549.00
## Median 9478.50
## Q3 11603.00
## Max 215245.00
## MAD 2962.23
## IQR 4048.00
## CV 0.95
## Skewness 12.18
## SE.Skewness 0.06
## Kurtosis 202.26
## N.Valid 1460.00
## Pct.Valid 100.00
## Descriptive Statistics
## data$OverallQual
## N: 1460
##
## OverallQual
## ----------------- -------------
## Mean 6.10
## Std.Dev 1.38
## Min 1.00
## Q1 5.00
## Median 6.00
## Q3 7.00
## Max 10.00
## MAD 1.48
## IQR 2.00
## CV 0.23
## Skewness 0.22
## SE.Skewness 0.06
## Kurtosis 0.09
## N.Valid 1460.00
## Pct.Valid 100.00
## Descriptive Statistics
## data$YearBuilt
## N: 1460
##
## YearBuilt
## ----------------- -----------
## Mean 1971.27
## Std.Dev 30.20
## Min 1872.00
## Q1 1954.00
## Median 1973.00
## Q3 2000.00
## Max 2010.00
## MAD 37.06
## IQR 46.00
## CV 0.02
## Skewness -0.61
## SE.Skewness 0.06
## Kurtosis -0.45
## N.Valid 1460.00
## Pct.Valid 100.00
## Descriptive Statistics
## data$SalePrice
## N: 1460
##
## SalePrice
## ----------------- -----------
## Mean 180921.20
## Std.Dev 79442.50
## Min 34900.00
## Q1 129950.00
## Median 163000.00
## Q3 214000.00
## Max 755000.00
## MAD 56338.80
## IQR 84025.00
## CV 0.44
## Skewness 1.88
## SE.Skewness 0.06
## Kurtosis 6.50
## N.Valid 1460.00
## Pct.Valid 100.00
attach(data)
p1<-ggplot(data,aes(x=GrLivArea))+geom_density()+
labs(title = paste("Density function of", "GrLivArea"),x="GrLivArea",y="Values")
p2<-ggplot(data,aes(x=GrLivArea))+geom_histogram()+
labs(title = paste("Histogram of", "GrLivArea"), x="GrLivArea",y="Values")
p3<-ggplot(data,aes(x=GrLivArea))+
geom_boxplot(outlier.colour="red", outlier.shape=1,outlier.size=2)+
coord_flip()+labs(title = paste("Boxplot of", "GrLivArea"),x="Values",y="")
# This function controls the graphical output device
# Remember that package 'ggpubr' is required
ggarrange(p1,p2,p3, nrow=1, common.legend = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p1<-ggplot(data,aes(x=GarageArea))+geom_density()+
labs(title = paste("Density function of", "GarageArea"),x="GarageArea",y="Values")
p2<-ggplot(data,aes(x=GarageArea))+geom_histogram()+
labs(title = paste("Histogram of", "GarageArea"), x="GarageArea",y="Values")
p3<-ggplot(data,aes(x=GarageArea))+
geom_boxplot(outlier.colour="red", outlier.shape=1,outlier.size=2)+
coord_flip()+labs(title = paste("Boxplot of", "GarageArea"),x="Values",y="")
# This function controls the graphical output device
# Remember that package 'ggpubr' is required
ggarrange(p1,p2,p3, nrow=1, common.legend = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p1<-ggplot(data,aes(x=X1stFlrSF))+geom_density()+
labs(title = paste("Density function of", "X1stFlrSF"),x="X1stFlrSF",y="Values")
p2<-ggplot(data,aes(x=X1stFlrSF))+geom_histogram()+
labs(title = paste("Histogram of", "X1stFlrSF"), x="X1stFlrSF",y="Values")
p3<-ggplot(data,aes(x=X1stFlrSF))+
geom_boxplot(outlier.colour="red", outlier.shape=1,outlier.size=2)+
coord_flip()+labs(title = paste("Boxplot of", "X1stFlrSF"),x="Values",y="")
# This function controls the graphical output device
# Remember that package 'ggpubr' is required
ggarrange(p1,p2,p3, nrow=1, common.legend = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p1<-ggplot(data,aes(x=LotArea))+geom_density()+
labs(title = paste("Density function of", "LotArea"),x="LotArea",y="Values")
p2<-ggplot(data,aes(x=LotArea))+geom_histogram()+
labs(title = paste("Histogram of", "LotArea"), x="LotArea",y="Values")
p3<-ggplot(data,aes(x=LotArea))+
geom_boxplot(outlier.colour="red", outlier.shape=1,outlier.size=2)+
coord_flip()+labs(title = paste("Boxplot of", "LotArea"),x="Values",y="")
# This function controls the graphical output device
# Remember that package 'ggpubr' is required
ggarrange(p1,p2,p3, nrow=1, common.legend = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p1<-ggplot(data,aes(x=OverallQual))+geom_density()+
labs(title = paste("Density function of", "OverallQual"),x="OverallQual",y="Values")
p2<-ggplot(data,aes(x=OverallQual))+geom_histogram()+
labs(title = paste("Histogram of", "OverallQual"), x="OverallQual",y="Values")
p3<-ggplot(data,aes(x=OverallQual))+
geom_boxplot(outlier.colour="red", outlier.shape=1,outlier.size=2)+
coord_flip()+labs(title = paste("Boxplot of", "OverallQual"),x="Values",y="")
# This function controls the graphical output device
# Remember that package 'ggpubr' is required
ggarrange(p1,p2,p3, nrow=1, common.legend = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p1<-ggplot(data,aes(x=YearBuilt))+geom_density()+
labs(title = paste("Density function of", "YearBuilt"),x="YearBuilt",y="Values")
p2<-ggplot(data,aes(x=YearBuilt))+geom_histogram()+
labs(title = paste("Histogram of", "YearBuilt"), x="YearBuilt",y="Values")
p3<-ggplot(data,aes(x=YearBuilt))+
geom_boxplot(outlier.colour="red", outlier.shape=1,outlier.size=2)+
coord_flip()+labs(title = paste("Boxplot of", "YearBuilt"),x="Values",y="")
# This function controls the graphical output device
# Remember that package 'ggpubr' is required
ggarrange(p1,p2,p3, nrow=1, common.legend = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p1<-ggplot(data,aes(x=SalePrice))+geom_density()+
labs(title = paste("Density function of", "SalePrice"),x="SalePrice",y="Values")
p2<-ggplot(data,aes(x=SalePrice))+geom_histogram()+
labs(title = paste("Histogram of", "SalePrice"), x="SalePrice",y="Values")
p3<-ggplot(data,aes(x=SalePrice))+
geom_boxplot(outlier.colour="red", outlier.shape=1,outlier.size=2)+
coord_flip()+labs(title = paste("Boxplot of", "SalePrice"),x="Values",y="")
# This function controls the graphical output device
# Remember that package 'ggpubr' is required
ggarrange(p1,p2,p3, nrow=1, common.legend = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
To identify outliers, a prior graphical exploratory analysis must be conducted by creating boxplots for all variables.
Since all variables are quantitative, the decision has been made to replace outliers with the mean. To achieve this, the outlier function has been developed, which detects and manipulates outliers. It is worth noting that the decision to replace outliers with the mean would require a prior analysis of the cause of these atypical values.
boxplot(data,main="Exploratory data analysis",
xlab="House features",
ylab="Value",
col=c(1:ncol(data)))
outlier<-function(data,na.rm=T){
H<-1.5*IQR(data)
data[data<quantile(data,0.25, na.rm = T)-H]<-NA
data[data>quantile(data,0.75, na.rm = T)+H]<-NA
data[is.na(data)]<-mean(data, na.rm = T)
H<-1.5*IQR(data)
if (TRUE %in% (data<quantile(data,0.25,na.rm = T)-H) |
TRUE %in% (data>quantile(data,0.75,na.rm = T)+H))
outlier(data)
else
return(data)
}
data<-as.data.frame(apply(data,2,outlier))
par(mfrow=c(1,2))
# Fixed data
boxplot(data,main="Data without outliers",
xlab="House features",
ylab="Values",
col=c(1:ncol(data)))
The analysis of outliers is repeated below, but this time with normalized data to enhance the visualization of all outliers.
normalized_data<-scale(data)
# En el gráfico se observa que muchas variables presentan outliers.
boxplot(normalized_data,main="Exploratory data analysis",
xlab="House features",
ylab="Value",
col=c(1:ncol(data)))
Firstly, it is necessary to verify that the variables are not independent. At the sample level collected in the database, this can be done by calculating and observing the correlation matrix. At the population level, the justification for correlation can be established by performing the Bartlett test (the Bartlett sphericity test checks whether the correlations are significantly different from 0. The null hypothesis is that \(det(R) = 1\), where \(R\) is the correlation matrix. The following code performs the two mentioned checks.
cor(data)
## GrLivArea GarageArea X1stFlrSF LotArea OverallQual
## GrLivArea 1.000000000 0.41091556 0.41252812 0.003349105 0.53890566
## GarageArea 0.410915560 1.00000000 0.46060093 0.036038044 0.54785260
## X1stFlrSF 0.412528115 0.46060093 1.00000000 0.048127106 0.42530933
## LotArea 0.003349105 0.03603804 0.04812711 1.000000000 0.06670334
## OverallQual 0.538905657 0.54785260 0.42530933 0.066703337 1.00000000
## YearBuilt 0.251413273 0.48573031 0.28939644 -0.017380668 0.59009075
## SalePrice 0.544805293 0.48223074 0.39358026 -0.008347953 0.58579926
## YearBuilt SalePrice
## GrLivArea 0.25141327 0.544805293
## GarageArea 0.48573031 0.482230739
## X1stFlrSF 0.28939644 0.393580261
## LotArea -0.01738067 -0.008347953
## OverallQual 0.59009075 0.585799260
## YearBuilt 1.00000000 0.541428580
## SalePrice 0.54142858 1.000000000
normalized_data<-scale(data)
cortest.bartlett(cor(normalized_data))
## Warning in cortest.bartlett(cor(normalized_data)): n not specified, 100 used
## $chisq
## [1] 218.4913
##
## $p.value
## [1] 8.031927e-35
##
## $df
## [1] 21
In light of the test results \((p < 0.001)\), it can be stated that the data are not independent. Therefore, it makes sense to consider dimensionality reduction through Principal Component Analysis (PCA) or Factor Analysis (FA).
Other ways to observe that the variables are correlated, which will be useful for Factor Analysis, include:
#install.packages("polycor")
suppressMessages(library(polycor))
## Warning: package 'polycor' was built under R version 4.3.2
#install.packages("ggcorrplot")
suppressMessages(library(ggcorrplot))
## Warning: package 'ggcorrplot' was built under R version 4.3.2
poly_cor<-hetcor(data)$correlations
ggcorrplot(poly_cor, type="lower",hc.order=T)
#install.packages("corrplot")
suppressMessages(library(corrplot))
corrplot(cor(data), order = "hclust", tl.col='black', tl.cex=1)
#install.packages("corrr")
suppressMessages(library(corrr))
## Warning: package 'corrr' was built under R version 4.3.2
data_correlations <- correlate(data) #Cálculo de un objeto de correlaciones.
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
rplot(data_correlations, legend = TRUE, colours = c("firebrick1", "black","darkcyan"), print_cor = TRUE)
Based on the results obtained in the previous sections, it is now possible to proceed with dimensionality reduction through either Principal Component Analysis or Factor Analysis. The data exhibit correlations, there are no missing values, and the presence of outliers has been addressed.
The following code performs Principal Component Analysis (PCA), obtaining the eigenvectors that generate each component, as well as their eigenvalues, which correspond to the variance of each component.
PCA<-prcomp(data, scale=T, center = T)
PCA$rotation
## PC1 PC2 PC3 PC4 PC5
## GrLivArea -0.3845483 -0.007471144 0.53047508 0.52719642 0.17343324
## GarageArea -0.4155496 0.027712787 -0.06793035 -0.40089276 0.77286107
## X1stFlrSF -0.3547808 0.111094486 0.49421252 -0.63164333 -0.46533385
## LotArea -0.0228391 0.981395441 -0.13124212 0.10059078 -0.02738571
## OverallQual -0.4582233 0.033760363 -0.14708590 0.15744458 0.01843438
## YearBuilt -0.3879810 -0.121223988 -0.65348617 -0.09627253 -0.26231022
## SalePrice -0.4388077 -0.088668881 -0.06190899 0.34380218 -0.29355517
## PC6 PC7
## GrLivArea 0.14794959 -0.49067171
## GarageArea -0.25213995 0.01822619
## X1stFlrSF 0.04024664 -0.02034391
## LotArea -0.06741260 -0.06085992
## OverallQual 0.64591076 0.57002461
## YearBuilt 0.11046995 -0.56342936
## SalePrice -0.69207499 0.33527687
PCA$sdev
## [1] 1.8282024 1.0058011 0.8989781 0.8150016 0.6918485 0.6343287 0.5409468
summary(PCA)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8282 1.0058 0.8990 0.81500 0.69185 0.63433 0.5409
## Proportion of Variance 0.4775 0.1445 0.1154 0.09489 0.06838 0.05748 0.0418
## Cumulative Proportion 0.4775 0.6220 0.7375 0.83234 0.90071 0.95820 1.0000
The following graphs illustrate the behavior of the variance explained by each principal component, as well as the cumulative explained variance.
suppressMessages(library(ggplot2))
# Explained variance
explained_var <- PCA$sdev^2 / sum(PCA$sdev^2)
ggplot(data = data.frame(explained_var, pc = 1:ncol(data)),
aes(x = pc, y = explained_var, fill=explained_var )) +
geom_col(width = 0.3) +
scale_y_continuous(limits = c(0,0.6)) + theme_bw() +
labs(x = "Principal component", y= " Proportion of the explained variance")
accum_var<-cumsum(explained_var)
ggplot( data = data.frame(accum_var, pc = 1:ncol(data)),
aes(x = pc, y = accum_var ,fill=accum_var )) +
geom_col(width = 0.5) +
scale_y_continuous(limits = c(0,1)) +
theme_bw() +
labs(x = "Principal component",
y = "Proportion of the accumulated explained variance")
To choose the appropriate number of principal components, the following method is employed: The variances explained by the principal components are averaged, and those components whose proportion of explained variance surpasses the mean are selected.
PCA$sdev^2
## [1] 3.3423241 1.0116359 0.8081616 0.6642276 0.4786544 0.4023729 0.2926235
mean(PCA$sdev^2)
## [1] 1
By Rule of Abdi, 2 components are chosen. Each principal component is obtained as the linear combination of all variables with the coefficients shown in columns of the rotation matrix.
The following graphs display a comparison between different principal components through a two-dimensional projection. In this representation, it is evident which of the original variables carry greater or lesser weight in each of the juxtaposed components.
suppressMessages(library("factoextra"))
# Comparativa entre la 1ª y 2ª componente principal.
fviz_pca_var(PCA,
repel=TRUE,col.var="cos2",
legend.title="Distance")+theme_bw()
fviz_pca_var(PCA,axes=c(1,3),
repel=TRUE,col.var="cos2",
legend.title="Distance")+theme_bw()
fviz_pca_var(PCA,axes=c(2,3),
repel=TRUE,col.var="cos2",
legend.title="Distance")+theme_bw()
It is also possible to represent the observations of the objects along with the principal components using the “contrib” argument of the previous “fviz_pca_ind” function. Additionally, observations that explain higher variance can be identified with colors.
fviz_pca_ind(PCA,col.ind = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel=TRUE,legend.title="Contrib.var")+theme_bw()
To extract the factors, a method of estimation must be chosen. The
fa() function is used for this purpose, as
it allows the implementation of up to 6 different methods.
Next, the outputs are compared using the principal factor method and the maximum likelihood method. Three factors are calculated, as suggested by the correlation graphs, although it is not entirely clear in this example.
# Likelihood method
model1<-fa(poly_cor,
nfactors = 3,
rotate = "none",
fm="mle")
# minimal residual model
model2<-fa(poly_cor,
nfactors = 3,
rotate = "none",
fm="minres")
The communalities of both models are compared to see what proportion of the variance is explained by the latent factors.
sort(model1$communality,decreasing = T)->c1
sort(model2$communality,decreasing = T)->c2
head(cbind(c1,c2))
## c1 c2
## YearBuilt 0.8733819 0.8779251
## GrLivArea 0.7817811 0.7882180
## OverallQual 0.6222477 0.6205131
## GarageArea 0.5676882 0.5587136
## SalePrice 0.5566391 0.5586521
## X1stFlrSF 0.3969742 0.4021478
The uniqueness, which represents the proportion of variance not explained by the factor (1-communality), is also compared.
sort(model1$uniquenesses,decreasing = T)->u1
sort(model2$uniquenesses,decreasing = T)->u2
head(cbind(u1,u2))
## u1 u2
## LotArea 0.9833749 0.9853618
## X1stFlrSF 0.6030258 0.5978522
## SalePrice 0.4433609 0.4413479
## GarageArea 0.4323118 0.4412864
## OverallQual 0.3777523 0.3794869
## GrLivArea 0.2182189 0.2117820
We can see that both are quite similar, since we are obtaining very closed values.
Different criteria exist for determining the optimal number of factors. The elbow method is implemented below.
The elbow method involves creating a scree plot, obtained by plotting eigenvalues in decreasing order on the y-axis and the corresponding component numbers for principal components or latent factors (depending on whether PCA or FA is being conducted) on the x-axis. Connecting the points creates a figure that starts with a steep slope and then transitions to a gentler incline. The method suggests selecting the number of factors up to the “elbow” of the plot, where the steep slope gives way to the gentler incline.
It is called a scree plot because it resembles the profile of a mountain with a steep slope leading to the base, where pebbles fallen from the summit accumulate (where they “sediment”).
scree(poly_cor)
fa.parallel(poly_cor,n.obs=200,fa="fa",fm="minres")
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect. Try a
## different factor score estimation method.
## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected. Examine the results carefully
## Parallel analysis suggests that the number of factors = 4 and the number of components = NA
We can see that the optimum number of factors is 4. Another form of calculating it is with the following tests:
library(stats)
factanal(data,factors=3,rotation="none") # 4 gives error
##
## Call:
## factanal(x = data, factors = 3, rotation = "none")
##
## Uniquenesses:
## GrLivArea GarageArea X1stFlrSF LotArea OverallQual YearBuilt
## 0.218 0.432 0.603 0.983 0.378 0.127
## SalePrice
## 0.443
##
## Loadings:
## Factor1 Factor2 Factor3
## GrLivArea 0.594 0.637 -0.151
## GarageArea 0.659 0.112 0.347
## X1stFlrSF 0.491 0.258 0.298
## LotArea 0.124
## OverallQual 0.774 0.139
## YearBuilt 0.840 -0.404
## SalePrice 0.725 0.172
##
## Factor1 Factor2 Factor3
## SS loadings 2.859 0.699 0.257
## Proportion Var 0.408 0.100 0.037
## Cumulative Var 0.408 0.508 0.545
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 11.85 on 3 degrees of freedom.
## The p-value is 0.0079
We can see that 4 gives us error, so we should take 3 as the number of optimum factors.
Finally, the factorial model with 3 factors is estimated. To achieve this, a varimax rotation is implemented to seek a simpler interpretation of reality.
varimax_model<-fa(poly_cor,nfactors = 3,rotate = "varimax",
fa="mle")
The rotated factorial weight matrix is:
print(varimax_model$loadings,cut=0)
##
## Loadings:
## MR1 MR2 MR3
## GrLivArea 0.150 0.873 0.065
## GarageArea 0.493 0.355 0.435
## X1stFlrSF 0.273 0.393 0.416
## LotArea -0.005 0.003 0.121
## OverallQual 0.573 0.501 0.204
## YearBuilt 0.927 0.132 -0.042
## SalePrice 0.516 0.533 0.093
##
## MR1 MR2 MR3
## SS loadings 1.793 1.594 0.434
## Proportion Var 0.256 0.228 0.062
## Cumulative Var 0.256 0.484 0.546
Visually, one could make an effort to see which variables each of the factors correlates with in the factorial matrix. However, this is quite tedious, so the following representation is used:
fa.diagram(varimax_model)
We can see that the first latent factor correlations YearBuilt, OverallQual, GarageArea, the second one correlations GrLivArea, SalePrice and the third one X1stFlrSF.
The database contains information on various indicators measured across different residential homes, including the house price. Therefore, it would be interesting to provide a classification method for the price level based on other indicators.
To achieve this, a qualitative variable is defined as the response variable for classification, called “precio” (price). This variable has two categories:
High price if it is greater than the mean.
Low price if it is less than the mean.
price=c()
mean_price = mean(data$SalePrice)
for(k in 1:nrow(data)){
if(data$SalePrice[k]<mean_price){
price[k] = "low"
}
else
price[k] = "high"
}
price<-as.factor(price)
price
## [1] high high high low high low high high low low low high low high
## [15] low low low low high low high low high low low high low high
## [29] high low high low high high high high low low low low high high
## [43] low low low high high high low low high low low high low high
## [57] high high high low high low high low high high high high low high
## [71] high low high low low low low low low low high low high low
## [85] high high high high low low low low high low high high high low
## [99] low low high high low high high high low low low high low high
## [113] high high high high low low high high high low low low high low
## [127] low low low low high high low high high high low high high high
## [141] low high high high low low low high low low low high high high
## [155] low low low high high high high high high low low low high high
## [169] high high low high high high high high high high high low high high
## [183] low high low high high low low high high high high low low low
## [197] high high low high low high low low low high low low high low
## [211] low high high low high low high low high high high high high low
## [225] high low high low low high low high low low high low high high
## [239] high low high low low low high high low low high high low high
## [253] high high low high high high high low high high low low low high
## [267] high high low low high high high low low high high low high high
## [281] high high high high high high high low low low high low low high
## [295] high low low high high high low high high low high high high low
## [309] low high high low low high high high high high high high high high
## [323] high low high low high low high low low low high high high high
## [337] high high high low high low low high low low low low low high
## [351] high high low low low high high low low high low low high low
## [365] high low high high low high high low low low high low low high
## [379] high high low high high low high high low low high high low high
## [393] low low low low low high low high high high low high high low
## [407] low high high high low low high low high high low high low low
## [421] high high low high low low high low high high low low low high
## [435] low high low low low low high low high high high low high high
## [449] low low low high high high high high low high high low high low
## [463] low high low high high low high high high high low high high low
## [477] high high high low high high low high low low low high high low
## [491] low low high low low high high high low low low high low high
## [505] low low high high high low high high low low low high high high
## [519] high high low low high high high high low high low high high low
## [533] low high high low high low high high high high high low high high
## [547] high low low high low low high low high low low low high high
## [561] low high low high high low high high high low low low high high
## [575] low low low high low low high high low high low high low low
## [589] low low high high low low low high low high high low high low
## [603] high low high high low high high low high low high low low low
## [617] high low high high low high low high high high low low low high
## [631] low high low low low high low low low high high high high low
## [645] high low low low low low high low high low high low low low
## [659] low high high high low low high high low high high low high low
## [673] high high low low low low high low low high high high high high
## [687] high low high high low high high low low high low low low high
## [701] high low high low high low high high high low high low high low
## [715] low high high low high low high low low low high low high high
## [729] low low high high high low low high low high high high low low
## [743] high high high high high high high low low high high high low high
## [757] high high high high low low high high high high high high high high
## [771] low low low low high high high low low low high high high high
## [785] low high low high low high high low high high high high low low
## [799] high high high low high high low high low high high low high low
## [813] low low low high low high low high high low high low high high
## [827] low high high low high low high high low low low low low low
## [841] low low high low low high high low high high low high high high
## [855] high low low high low high high low low low high low high low
## [869] high high low high low low low high low high low low low high
## [883] high low low high low low high low low high low high low low
## [897] low low high low low low high high low low high high low high
## [911] low low low low high low high low high high high low high high
## [925] high high high high high high high low high high high low high high
## [939] high high low high low low low low low high high high low low
## [953] low high low low low low high low low high low high high high
## [967] high low high low low high low high high high low high low low
## [981] high high high high low low low high high high high high high high
## [995] high low low high low high low low high low high low high low
## [1009] high low low low high low low high high high high high high high
## [1023] low high high low high high low low high high high high low low
## [1037] high high low low low high high high high low high low low low
## [1051] high high high low high high high high high high high low low low
## [1065] low high high high low low low low low high high high high low
## [1079] low low low low high high high low low high low high low high
## [1093] low low low high low high low low low low low high low high
## [1107] high high high high high high low low low high high low low low
## [1121] low high low low high low high high high low low low low high
## [1135] high low low low high low low high high low low low high high
## [1149] low low low low high low high high high high high high low high
## [1163] low low high high high high high high high high high high high high
## [1177] low low low low high high high low high low low high high high
## [1191] high high low high high high high low high low low high low high
## [1205] low high low high low high high high low low low low low high
## [1219] low low low low low low high low high low high low high low
## [1233] low low low low high high low high high high high high high high
## [1247] high high low low high high low high high low high low high low
## [1261] high low high high high high low high high low high high low high
## [1275] low low high high high low high high low low high low low high
## [1289] high high high low low high low low low low high low high high
## [1303] high high low high high low low high high high high high low high
## [1317] high high high low low low high low low low low low high high
## [1331] high low low low low high low high high low low low high high
## [1345] low low high high high low high high low high high high low low
## [1359] high high high high low low low high high low low high low high
## [1373] high high high high low low low high low high low low low low
## [1387] high low high low high low low high high high high low low low
## [1401] low high high high low high low low low high high low low high
## [1415] high high low high low high high low low high low low high low
## [1429] low high high low low high high high low high low high high low
## [1443] high low high low low high low low low high low low high high
## [1457] high high low low
## Levels: high low
Firstly, we explore how well (or poorly) each of the independently measured variables classifies the profits.
suppressMessages(library(ggpubr))
p1 <- ggplot(data = data, aes(x = GrLivArea, fill = price)) +
geom_histogram(position = "identity", alpha = 0.5)
p2 <- ggplot(data = data, aes(x = GarageArea, fill = price)) +
geom_histogram(position = "identity", alpha = 0.5)
p3 <- ggplot(data = data, aes(x = X1stFlrSF, fill = price)) +
geom_histogram(position = "identity", alpha = 0.5)
p4 <- ggplot(data = data, aes(x = LotArea, fill = price)) +
geom_histogram(position = "identity", alpha = 0.5)
p5 <- ggplot(data = data, aes(x = OverallQual, fill = price)) +
geom_histogram(position = "identity", alpha = 0.5)
p6 <- ggplot(data = data, aes(x = YearBuilt, fill = price)) +
geom_histogram(position = "identity", alpha = 0.5)
p1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p3
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p4
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p5
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p6
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It doesn’t seem that any single variable adequately separates the price. Next, we explore which pairs of variables better differentiate between price levels.
pairs(x = data[, c("GrLivArea","GarageArea", "X1stFlrSF", "LotArea", "OverallQual", "YearBuilt")],
col = c("firebrick", "green3")[price], pch = 19)
Next, a graphical exploration of the normality of individual distributions of our predictors is conducted by representing histograms and qqplots.
Individual distributions
par(mfcol = c(2, 3))
for (k in c(1,2,3,5,6,7)) {
j0 <- names(data)[k]
x0 <- seq(min(data[, k]), max(data[, k]), le = 50)
for (i in 1:2) {
i0 <- levels(price)[i]
x <- data[price == i0, j0]
hist(x, proba = T, col = grey(0.8), main = paste("Price", i0), xlab = j0)
lines(x0, dnorm(x0, mean(x), sd(x)), col = "red", lwd = 2)
}
}
qqplots graphics
par(mfrow=c(2,3))
for (k in c(1,2,3,5,6,7)) {
j0 <- names(data)[k]
x0 <- seq(min(data[, k]), max(data[, k]), le = 50)
for (i in 1:2) {
i0 <- levels(price)[i]
x <- data[price == i0, j0]
qqnorm(x, main = paste("price", i0, j0), pch = 19, col = i + 1)
qqline(x)
}
}
This exploratory analysis can provide an idea of the possible normal distribution of univariate variables, but it is always better to perform the respective tests of normality.
library(reshape2)
shap<-function(x){shapiro.test(x)$p.value}
datos1=data
datos1$price=price
datos_tidy <- melt(datos1, value.name = "value")
## Using price as id variables
aggregate(value ~ price + variable, data = datos_tidy,
FUN = shap)
## price variable value
## 1 high GrLivArea 1.733750e-10
## 2 low GrLivArea 1.327903e-16
## 3 high GarageArea 3.331606e-13
## 4 low GarageArea 3.184144e-12
## 5 high X1stFlrSF 1.213088e-06
## 6 low X1stFlrSF 1.327458e-12
## 7 high LotArea 1.488615e-22
## 8 low LotArea 4.731465e-21
## 9 high OverallQual 1.385541e-18
## 10 low OverallQual 3.517968e-22
## 11 high YearBuilt 7.327021e-30
## 12 low YearBuilt 5.072653e-09
## 13 high SalePrice 1.363801e-24
## 14 low SalePrice 6.489266e-15
As observed in the results above, every p-value obtained is below the selected significance level. Consequently, we are unable to assume normality for any variable.
The MVN package in R provides functions for conducting three commonly used tests to assess multivariate normality. Additionally, it includes functions for analyzing outliers, as multivariate normality can be influenced by the presence of outliers.
suppressMessages(library(MVN))
## Warning: package 'MVN' was built under R version 4.3.2
royston_test <- mvn(data = data, mvnTest = "royston", multivariatePlot = "qq")
royston_test$multivariateNormality
## Test H p value MVN
## 1 Royston 565.0778 1.532065e-117 NO
hz_test <- mvn(data = data, mvnTest = "hz")
hz_test$multivariateNormality
## Test HZ p value MVN
## 1 Henze-Zirkler 5.632571 0 NO
So we cannot assume normality.
We do not have the normality assumption, but we will continue with the analysis (assuming that this condition holds, which may lead to a result that is not entirely accurate or valid, depending on the context of the analysis):
suppressMessages(library(MASS))
modelo_lda <- lda( price ~ GrLivArea + GarageArea + X1stFlrSF + LotArea + OverallQual + YearBuilt ,data = data)
modelo_lda
## Call:
## lda(price ~ GrLivArea + GarageArea + X1stFlrSF + LotArea + OverallQual +
## YearBuilt, data = data)
##
## Prior probabilities of groups:
## high low
## 0.5356164 0.4643836
##
## Group means:
## GrLivArea GarageArea X1stFlrSF LotArea OverallQual YearBuilt
## high 1684.239 553.1393 1268.873 9392.579 6.942591 1986.734
## low 1200.428 360.8582 986.655 9392.747 5.141750 1954.392
##
## Coefficients of linear discriminants:
## LD1
## GrLivArea -0.0014546451
## GarageArea -0.0003755866
## X1stFlrSF -0.0004928186
## LotArea 0.0061661599
## OverallQual -0.4184650698
## YearBuilt -0.0177120055
The output of this object displays the prior probabilities of each group, in this case, 0.54 for each high price and 0.46 for low, the means of each predictor by group, and the coefficients of the linear discriminant classification model.
Once the classifier is built, new data can be classified based on its measurements simply by calling the predict function.
library(biotools)
## Warning: package 'biotools' was built under R version 4.3.2
## ---
## biotools version 4.2
pred <- predict(modelo_lda, dimen = 1)
confusionmatrix(price, pred$class)
## new high new low
## high 678 104
## low 71 607
trainig_error <- mean(price != pred$class) * 100
paste("trainig_error=", trainig_error, "%")
## [1] "trainig_error= 11.986301369863 %"
In this case, it has a 88.1% of success.
The partimat function from the
klaR package allows for the visualization
of classification boundaries of a linear or quadratic discriminant model
for each pair of predictors. Each color represents a classification
region according to the model, and the centroid of each region, along
with the actual values of the observations, is displayed.
suppressMessages(library(klaR))
## Warning: package 'klaR' was built under R version 4.3.2
partimat(price ~ GrLivArea + GarageArea + X1stFlrSF + LotArea + OverallQual + YearBuilt,
data = data, method = "lda", prec = 200,
image.colors = c("darkgoldenrod1", "snow2"),
col.mean = "firebrick",nplots.vert =1, nplots.hor=3)
Similar to Linear Discriminant Analysis, for Quadratic Discriminant Analysis, one begins with the graphical exploration of data and checks on univariate and multivariate normality and homogeneity of variances, which have already been performed previously.
suppressMessages(library(MASS))
qda_model<- qda(price ~ GrLivArea + GarageArea + X1stFlrSF + LotArea + OverallQual + YearBuilt,data = data)
qda_model
## Call:
## qda(price ~ GrLivArea + GarageArea + X1stFlrSF + LotArea + OverallQual +
## YearBuilt, data = data)
##
## Prior probabilities of groups:
## high low
## 0.5356164 0.4643836
##
## Group means:
## GrLivArea GarageArea X1stFlrSF LotArea OverallQual YearBuilt
## high 1684.239 553.1393 1268.873 9392.579 6.942591 1986.734
## low 1200.428 360.8582 986.655 9392.747 5.141750 1954.392
The output of this object displays the prior probabilities of each group, in this case, 0.54 and 0.46, and the means of each predictor by group.
Once the classifier is built, new data can be classified based on its measurements simply by calling the predict function.
pred <- predict(qda_model, dimen = 1)
confusionmatrix(price, pred$class)
## new high new low
## high 681 101
## low 80 598
trainig_error <- mean(price != pred$class) * 100
paste("trainig_error=", trainig_error, "%")
## [1] "trainig_error= 12.3972602739726 %"
The percentage of success is around 88%.
The partimat function from the
klaR package allows for the visualization
of classification boundaries of a linear or quadratic discriminant model
for each pair of predictors. Each color represents a classification
region according to the model, and the centroid of each region, along
with the actual values of the observations, is displayed.
suppressMessages(library(klaR))
partimat(price ~ GrLivArea + GarageArea + X1stFlrSF + LotArea + OverallQual + YearBuilt,
data = data, method = "qda", prec = 200,
image.colors = c("darkgoldenrod1", "snow2"),
col.mean = "firebrick",nplots.vert =1, nplots.hor=3)
Hierarchical clustering is interested in finding a hierarchy based on the closeness or similarity of the data according to the distance considered. In the agglomerative case, we start from a group with the closest observations. The next closest pairs are then calculated and groups are generated in an ascending manner. This construction can be observed visually by means of a dendrogram.
Below it will be illustrated how the groups are defined by the number of vertical lines in the dendrogram, and the selection of the optimal number of groups can be estimated from this same graph.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'readr' was built under R version 4.3.2
## Warning: package 'forcats' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ psych::%+%() masks ggplot2::%+%()
## ✖ psych::alpha() masks ggplot2::alpha()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ✖ tibble::view() masks summarytools::view()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Package required to call 'clusGap' function
library(cluster)
# Package required to call 'get_dist', 'fviz_cluster' and 'fviz_dist' functions
library(factoextra)
# Package required to call 'ggdendrogram' function
library(ggdendro)
## Warning: package 'ggdendro' was built under R version 4.3.2
##
## Attaching package: 'ggdendro'
##
## The following object is masked from 'package:summarytools':
##
## label
# Package required to call 'grid.arrange' function
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
dendrogram <- hclust(dist(data, method = 'euclidean'), method = 'ward.D')
ggdendrogram(dendrogram, rotate = FALSE, labels = FALSE, theme_dendro = TRUE) +
labs(title = "Dendrograma")
The R language implements the K-means algorithm with the function of the same name. This function receives as input parameters the data and the number of groupings to be performed (centers parameter). To address the problem of choosing initial seed points it incorporates the nstart parameter that tests multiple initial configurations and reports on the best one. For example, if nstart = 25, it will generate 25 initial configurations. The use of this parameter is recommended.
k3 <- kmeans(data, centers = 3, nstart = 25)
str(k3)
## List of 9
## $ cluster : int [1:1460] 2 3 2 3 2 3 3 2 1 1 ...
## $ centers : num [1:3, 1:7] 1155 1733 1550 325 562 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:7] "GrLivArea" "GarageArea" "X1stFlrSF" "LotArea" ...
## $ totss : num 2.38e+12
## $ withinss : num [1:3] 1.73e+11 8.74e+10 1.34e+11
## $ tot.withinss: num 3.94e+11
## $ betweenss : num 1.98e+12
## $ size : int [1:3] 461 272 727
## $ iter : int 2
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
fviz_cluster(k3,data=data)
set.seed(123)
k2 <- kmeans(data, centers = 2, nstart = 25)
k3 <- kmeans(data, centers = 3, nstart = 25)
k4 <- kmeans(data, centers = 4, nstart = 25)
k5 <- kmeans(data, centers = 5, nstart = 25)
# Plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = data) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = data) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = data) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = data) + ggtitle("k = 5")
grid.arrange(p1, p2, p3, p4, nrow = 2)
set.seed(123)
fviz_nbclust(data, kmeans, method = "wss")
4 would be the optimal amount
set.seed(123)
fviz_nbclust(data, kmeans, method = "silhouette")
set.seed(123)
gap_stat <- clusGap(data, FUN = kmeans, nstart = 25,K.max = 10, B = 50)
fviz_gap_stat(gap_stat)
Each method tell us a different number of clusters, but they tend to be around 2, so we will stick to 2 as the optimal amount of clusters.
set.seed(123)
final <- kmeans(data, 2, nstart = 25)
print(final)
## K-means clustering with 2 clusters of sizes 817, 643
##
## Cluster means:
## GrLivArea GarageArea X1stFlrSF LotArea OverallQual YearBuilt SalePrice
## 1 1669.197 548.1040 1260.6442 9392.569 6.893643 1985.866 186152.9
## 2 1193.205 356.7898 981.7482 9392.769 5.105920 1953.735 122203.6
##
## Clustering vector:
## [1] 1 1 1 2 1 2 1 1 2 2 2 1 2 1 1 2 2 2 1 2 1 2 1 2 2 1 2 1 1 2 1 2 1 1 1 1 2
## [38] 2 2 2 1 1 2 2 2 1 1 1 2 2 1 2 2 1 2 1 1 1 1 2 1 2 1 2 1 1 1 1 2 1 1 2 1 2
## [75] 2 2 2 2 2 2 1 2 1 2 1 1 1 1 2 2 2 2 1 2 1 1 1 2 2 2 1 1 2 1 1 1 2 2 2 1 2
## [112] 1 1 1 1 1 2 1 1 1 1 2 2 2 1 2 2 2 1 2 1 1 2 1 1 1 2 1 1 1 2 1 1 1 2 2 2 1
## [149] 2 2 2 1 1 1 2 2 2 1 1 1 1 1 1 2 2 2 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 2 1 2
## [186] 1 1 2 2 1 1 1 1 2 2 2 1 1 2 1 2 1 2 2 2 1 2 2 1 2 2 1 1 1 1 2 1 2 1 1 1 1
## [223] 1 2 1 2 1 2 2 1 2 1 2 2 1 2 1 1 1 2 1 2 2 2 1 1 2 2 1 1 2 1 1 1 2 1 1 1 1
## [260] 2 1 1 2 2 2 1 1 1 2 2 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 2 2 2 1 2 2 1 1 2
## [297] 2 1 1 1 1 1 1 2 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 2 2 1
## [334] 1 1 1 1 1 1 1 1 2 2 1 2 2 2 1 2 1 1 1 2 2 2 1 1 2 2 1 1 2 1 2 1 2 1 1 2 1
## [371] 1 2 2 2 1 2 2 1 1 1 2 1 1 2 1 1 2 2 1 1 2 1 2 2 2 2 2 1 2 1 1 1 2 1 1 2 2
## [408] 1 1 1 2 2 1 2 1 1 2 1 2 2 1 1 2 1 2 2 1 2 1 1 2 2 2 1 2 1 2 2 2 2 1 2 1 1
## [445] 1 2 1 1 2 2 2 1 1 1 1 1 2 1 1 2 1 1 2 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 1 2 1
## [482] 1 1 1 2 2 1 1 1 2 2 2 1 1 2 1 1 1 2 2 2 1 2 1 2 2 1 1 1 2 1 1 2 2 2 1 1 1
## [519] 1 1 2 2 1 1 1 1 2 1 2 1 1 2 2 1 1 2 1 2 1 1 1 1 1 2 1 1 1 2 2 1 2 2 1 2 1
## [556] 2 2 2 1 1 2 1 2 1 1 2 1 1 1 2 2 2 1 1 2 2 2 1 2 2 1 1 2 1 2 1 2 2 2 2 1 1
## [593] 2 2 2 1 2 1 1 2 1 2 1 2 1 1 2 1 1 2 1 2 1 2 2 2 1 2 1 1 2 1 2 1 1 1 2 2 2
## [630] 1 2 1 2 2 2 1 2 2 2 1 1 1 1 2 1 2 2 1 1 2 1 2 1 2 1 2 2 2 2 1 1 1 2 2 1 1
## [667] 2 1 1 2 1 2 1 1 2 2 2 2 1 2 2 1 1 1 1 1 1 2 1 1 2 1 1 2 2 1 2 2 2 1 1 2 1
## [704] 2 1 2 1 1 1 2 1 2 1 2 2 1 1 1 1 2 1 2 2 2 1 2 1 1 2 2 1 1 1 2 2 1 2 1 1 1
## [741] 2 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1
## [778] 2 2 2 1 1 1 1 2 1 2 1 2 1 1 2 1 1 1 1 2 2 1 1 1 2 1 1 2 1 2 1 1 2 1 2 2 1
## [815] 2 1 2 1 1 1 1 2 1 2 1 1 2 1 1 2 1 2 1 1 2 2 2 2 2 2 2 1 1 2 2 1 1 2 1 1 2
## [852] 1 1 1 1 2 2 1 2 1 1 2 2 2 1 2 1 2 1 1 2 1 2 2 2 1 2 1 2 2 1 1 1 2 2 1 2 2
## [889] 1 2 2 1 1 1 2 2 2 2 1 2 2 2 1 1 2 2 1 1 2 1 1 2 2 2 1 2 1 2 1 1 1 2 1 1 1
## [926] 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 2 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 1 2 1
## [963] 1 1 1 1 1 2 1 2 2 1 2 1 1 1 2 1 2 2 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 2 2 1 2
## [1000] 1 2 2 1 2 1 2 1 2 1 2 2 2 1 2 2 1 1 1 1 1 1 1 2 1 1 2 1 1 2 2 1 1 1 1 2 2
## [1037] 1 1 2 2 1 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 1 1 1 2 2 2 2 1 1 1 2 2 2 2 2
## [1074] 1 1 1 1 2 1 2 2 2 1 1 1 2 2 1 2 1 2 1 2 2 2 1 2 1 2 1 2 2 2 1 2 1 1 1 1 1
## [1111] 1 1 2 2 2 1 1 2 2 2 2 1 2 2 1 2 1 1 1 2 2 2 2 1 1 2 2 2 1 2 2 1 1 2 2 2 1
## [1148] 1 2 2 2 2 1 2 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 2
## [1185] 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 2 2 1 2 1 2 1 2 1 2 1 1 1 2 2 2 2 2 1 2 2 2
## [1222] 2 2 2 1 2 1 2 1 2 1 2 2 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 2 1 2
## [1259] 1 2 1 2 1 1 1 1 2 1 1 2 1 1 2 1 2 2 1 1 1 2 1 1 2 2 1 2 2 1 1 1 1 2 2 1 2
## [1296] 2 1 2 1 2 1 1 1 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 2 1 2 2 2 2 2 1 1 1 2
## [1333] 2 2 2 1 2 1 1 2 2 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 2 2 1 1 1 1 2 1 2 1 1 2 2
## [1370] 1 2 1 1 1 1 1 2 2 2 1 2 1 1 2 2 2 1 2 1 2 1 2 2 1 1 1 1 2 2 2 2 1 1 1 2 1
## [1407] 2 2 2 1 1 2 2 1 1 1 2 1 2 1 1 2 2 1 2 2 1 2 2 1 1 2 2 1 1 1 2 1 2 1 1 2 1
## [1444] 2 1 2 1 1 2 2 2 1 2 2 1 1 1 1 2 2
##
## Within cluster sum of squares by cluster:
## [1] 595618318126 311797081882
## (between_SS / total_SS = 61.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
fviz_cluster(final, data = data)
data %>%
mutate(Cluster = final$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
## # A tibble: 2 × 8
## Cluster GrLivArea GarageArea X1stFlrSF LotArea OverallQual YearBuilt SalePrice
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1669. 548. 1261. 9393. 6.89 1986. 186153.
## 2 2 1193. 357. 982. 9393. 5.11 1954. 122204.
As seen in the graph above, we separated the data into 2 clusters, which are not entirely heterogeneous; however, there is a certain homogeneity within both clusters, as observed in the table above. One of them groups data with higher mean values, meanwhile the second one groups data with lower values. In both clusters, the data for LotArea is quite similar. That can explain why both clusters intersect.